Methodology


Project Structure: A Workflow R Project:

The project’s folder structure follows the standart for reproducible research in R and uses the workflowR package. This package is uses by many statisticians to write reproducible publications.

myproject/ ├── .gitignore ├── .Rprofile ├── _workflowr.yml ├── analysis/ │ ├── about.Rmd │ ├── index.Rmd │ ├── license.Rmd │ └── _site.yml ├── code/ │ ├── README.md ├── data/ │ └── README.md ├── docs/ ├── myproject.Rproj ├── output/ │ └── README.md └── README.md

Below, we describe each folder and which function it acomplishes:

The two main subdirectories are analysis/ and docs/ These are the directories where the analysis is done and compiled.

The helper directories are data/, code/, and output/. These directories are suggestions for organizing your data analysis project, but can be removed if you do not find them useful.

Data Wrangling and Feature Engineering Scripts: Scripts which are not included in modelling.

var_dict.R: This R script contains a dictionary with all variable names, given names, and whether it is selected or not. The purpose of this dict is to have a general idea or dictionary of all variables present in the data.

setup.R & setup.py: R and Python scripts that contain all the necessary libraries and imports requiered by the project.

functions.R & functions.py: R and Python scripts that contain miscelaneous functions used in the analysis.

Scripts that contain an algorithm used for modelling.

GBClassifier.py: A multiclassification algorithm with gradient boosting. The main class is ClassifierModel which takes an algorithm fits a multiclassification algorithm to the training data printing general accuracy functions such as the confusion matrix or the accuracy. The Multiclassification algorithm is given by a Model Wrapper, which is written in separate classes such as XGBWrapper, LGBWrapper or CatBoostWrapper. Finally, The main ClassifierModel calls another two classes, FeatureTransform and CategoricalTransform to do some variable preprocessing such as encoding categorical data.

There are also other unsused scripts written in python such as LSTM for regression and a Binary Classification algorithm which were not implemented.

The .Rprofile file is a regular R script that is run once when the project is opened. It contains the call library(“workflowr”), ensuring that workflowr is loaded automatically each time a workflowr-project is opened.

Development Environment

  • Python 3, tested with: Python 3.81

  • R 3.6

  • R and Python codes are use together through the Reticulate Package in Rstudio. I highly recommend trying this feature which integrates seemingly these programming languages.

The recommended installation is as follows: With R-Cran and Python installed, download Rstudio, and install all packaged in code/setup.R and code/setup.py scripts. The most probable scenario is that there is going to be libraries issues when installing R. This can be solved by reading the ERROR scripts and located which libraries are missing. For instance, “geos” is a common library missing both in Arch Linux and Ubuntu.

Theory


Data Labeling

Machine health is inversely proportional to the engine cycles. When number of engine cycles are increasing, then the machine health should decrease. It can be model as a linear function but here we use a piece-wise linear function. We assume that first few cycles have the maximum health of the machine and then it starts to decrease linearly.

Test data generation

If we concatanate all the training labels and testing labels its looks like this (a). It’s clear that the training labels always goes to zero TTF but testing labels need not to go zero TTF. Therefore, model must see something like testing data shown on (c) to get a physically meaningfull prediction.

RUL: Remaining Usefull Life.

TTF: Time To Failure.

Heat map & dendogram

The easiest way to understand a heat map is to think of a cross table or spreadsheet which contains colors instead of numbers. The default color gradient sets the lowest value in the heat map to dark blue, the highest value to a bright red, and mid-range values to light gray, with a corresponding transition (or gradient) between these extremes. Heat maps are well-suited for visualizing large amounts of multi-dimensional data and can be used to identify clusters of rows with similar values, as these are displayed as areas of similar color.

It is often useful to combine heat maps with hierarchical clustering, which is a way of arranging items in a hierarchy based on the distance or similarity between them. The result of a clustering calculation is presented either as the distance or the similarity between the clustered items depending on the selected distance measure. To learn more about hierarchical clustering in general, see Overview of Hierarchical Clustering Theory. You can cluster both rows and columns in the heat map. The result of a hierarchical clustering calculation is displayed in a heat map as a dendrogram, which is a tree-structure of the hierarchy. Row dendrograms show the distance (or similarity) between rows and which nodes each row belongs to as a result of the clustering calculation. Column dendrograms show the distance (or similarity) between the variables (the selected cell value columns). The example below shows a heat map with a row dendrogram where the distance between the rows were calculated.

Shapely Values

The summary plot combines feature importance with feature effects. Each point on the summary plot is a Shapley value for a feature and an instance. The position on the y-axis is determined by the feature and on the x-axis by the Shapley value. The color represents the value of the feature from low (blue) to high (red). Overlapping points are jittered in y-axis direction, so we get a sense of the distribution of the Shapley values per feature. The features are ordered according to their importance.

 




A work by